{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## WARNING: MAKE SURE YOU TERMINATE YOUR AWS INSTANCES WHEN YOU'RE DONE!!! OTHERWISE YOU WILL GET CHARGED (PROBABLY A LOT) BY AMAZON."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an AWS account\n",
    "\n",
    "* Go to https://aws.amazon.com acn click on the \"Sign In to the Console\" button on the upper right.\n",
    "\n",
    "---\n",
    "\n",
    "## Create an IAM user\n",
    "\n",
    "1. Log in (or create a new account) and click on the _Identity & Access Management_ link under the __Security & Identity__ section.\n",
    "2. Select _Users_ on the menu on the left and click on the _Create New Users_ button.\n",
    "3. Use __cs109__ for the user name and make sure the _Generate an access key for each user_ checkbox is selected.\n",
    "4. Click the _Download Credentials_ button to get the new credential keys. Once the file is downloaded, click the _Close_ button.\n",
    "5. Click on the newly created __cs109__ user.\n",
    "6. Scroll down until you see the _Attach Policy_ button and click on it.\n",
    "7. Search and select the __AdministratorAccess__ policy, and click the _Attach Policy_ button.\n",
    "\n",
    "#### You should now have a file called __credentials.csv__ on your default download path and a new user with the necessary permissions to perform the next actions.\n",
    "\n",
    "---\n",
    "\n",
    "## Download and configure the AWS CLI tools\n",
    "\n",
    "1. Go to http://docs.aws.amazon.com/cli/latest/userguide/installing.html and find the instructions for the platform you're using.\n",
    "2. Run the following on the command line: ```aws configure```\n",
    "3. Fill out the requested information (replace the ??? bellow with the values from the credentials file):\n",
    "```\n",
    "AWS Access Key ID [None]: ???\n",
    "AWS Secret Access Key [None]: ???\n",
    "Default region name [None]: us-east-1\n",
    "Default output format [None]: json\n",
    "```\n",
    "4. Run the following on the command line: ```aws emr create-default-roles```\n",
    "\n",
    "#### You should get a big JSON string as the output of this command.\n",
    "\n",
    "---\n",
    "\n",
    "## Create an EC2 SSH key pair\n",
    "\n",
    "1. Run the following on the command line:\n",
    "\n",
    "```\n",
    "aws ec2 create-key-pair --key-name CS109 --query 'KeyMaterial' --output text > CS109.pem\n",
    "chmod 400 CS109.pem\n",
    "```\n",
    "\n",
    "#### You should now have a file called __CS109.pem__ on your current directory.\n",
    "\n",
    "---\n",
    "\n",
    "## Create the Spark cluster\n",
    "\n",
    "* Run the following on the command line:\n",
    "```\n",
    "export CLUSTER_ID=`aws emr create-cluster --name \"CS109 Spark cluster\" \\\n",
    "--ami-version 3.10 --applications Name=Spark --ec2-attributes KeyName=CS109 \\\n",
    "--instance-type m1.medium --instance-count 3 --use-default-roles \\\n",
    "--bootstrap-actions Path=s3://cs109-2015/install-anaconda,Name=Install_Anaconda \\\n",
    "--query 'ClusterId' --output text` && echo $CLUSTER_ID\n",
    "```\n",
    "\n",
    "The output of this command will be something like the following (your actual value will be different):\n",
    "```\n",
    "j-33S87OUETACNK\n",
    "```\n",
    "\n",
    "It will take a few minutes for your cluster to be ready. Go watch some cat videos on YouTube and come back in 10 minutes or so.\n",
    "\n",
    "---\n",
    "\n",
    "## Connect to the iPython Notebook:\n",
    "\n",
    "1. Get the cluster master's IP:\n",
    "```\n",
    "export DNS_NAME=`aws emr describe-cluster --cluster-id $CLUSTER_ID \\\n",
    "--query 'Cluster.MasterPublicDnsName' --output text` && echo $DNS_NAME\n",
    "```\n",
    "If the output of this command is __None__, your cluster is not ready yet. Wait a few minutes and run it again. Repeat until you get something different from __None__.\n",
    "\n",
    "2. Create an SSH tunel to the AWS box (this assumes your SSH key is on the same directory you are invoking the SSH command from).\n",
    "```\n",
    "ssh -o ServerAliveInterval=10 -i CS109.pem hadoop@$DNS_NAME -N -L 8989:localhost:8888\n",
    "```\n",
    "\n",
    "3. Open your browser and got to http://localhost:8989\n",
    "\n",
    "---\n",
    "\n",
    "## Run a Spark job\n",
    "\n",
    "1. Copy and paste the following on a notebook cell and run it:\n",
    "```\n",
    "!hdfs dfs -copyFromLocal /home/hadoop/spark/README.md /user/hadoop/\n",
    "```\n",
    "2. Now do the same with the following snippet:\n",
    "\n",
    "```\n",
    "text_file = sc.textFile(\"/user/hadoop/README.md\")\n",
    "\n",
    "word_counts = text_file \\\n",
    "    .flatMap(lambda line: line.split()) \\\n",
    "    .map(lambda word: (word, 1)) \\\n",
    "    .reduceByKey(lambda a, b: a + b)\n",
    "\n",
    "word_counts.collect()\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "## Terminate the Spark cluster\n",
    "\n",
    "* Run the following on the command line:\n",
    "```\n",
    "aws emr terminate-clusters --cluster-ids $CLUSTER_ID\n",
    "aws ec2 delete-key-pair --key-name CS109\n",
    "rm CS109.pem\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
